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1.1 Introduction 



The reversible jump Markov chain Monte Carlo sampler (IGreeru . Il995l ) provides a general 
framework for Markov chain Monte Carlo (MCMC) simulation in which the dimension of the 
parameter space can vary between iterates of the Markov chain. The reversible jump sampler 
can be viewed as an extension of the Metropolis-Hastings algorithm onto more general state 
spaces. 

To understand this in a Bayesian modelling context, suppose that for observed data x we 
have a countable collection of candidate models M. = {A^i, A^2, • • •} indexed by a parameter 
k E K,. The index k can be considered as an auxiliary model indicator variable, such that 
M-k' denotes the model where k = k'. Each model Aik has an rifc-dimensional vector of 
unknown parameters. Ok G TZ"''', where Uk can take different values for different models 
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/c G /C. The joint posterior distribution of {k, Ok) given observed data, x, is obtained as 
the product of the hkehhood, L{x \ k,Ok), and the joint prior, p{k,Ok) = p{Ok I k)p{k), 
constructed from the prior distribution of dk under model A4k, and the prior for the model 
indicator k (i.e. the prior for model Aik)- Hence the joint posterior is 



7i{k,ek I x) = 



L{x 


k^OMOk 


k)p{k) 




X 


k', 0'kMOk' 


k')p{k')de' k, 



(1.1.1) 



The reversible jump algorithm uses the joint posterior distribution in Equation fll.l.ip as the 
target of a Markov chain Monte Carlo sampler over the state space = Ufcec({^} ^ T^"'''), 
where the states of the Markov chain are of the form {k,6k), the dimension of which can 
vary over the state space. Accordingly, from the output of a single Markov chain sampler, 
the user is able to obtain a full probabilistic description of the posterior probabilities of 
each model having observed the data, x, in addition to the posterior distributions of the 
individual models. 

This article aims to provide an overview of the reversible jump sampler. We will outline 
the sampler's theoretical underpinnings, present the latest and most popular techniques for 
enhancing algorithm performance, and discuss the analysis of sampler output. Through the 
use of numerous worked examples it is hoped that the reader will gain a broad appreciation 
of the issues involved in multi-model simulation, and the confidence to implement reversible 
jump samplers in the course of their own studies. 



1.1.1 Prom Metropolis-Hastings to reversible jump 



The standard formulation of the Metropolis-Hastings algorithm (jHastingsl . Il970l ) relies on 
the construction of a time-reversible Markov chain via the detailed balance condition. This 
condition means that moves from state 6 to 0' are made as often as moves from 6' to 
with respect to the target density. This is a simple way to ensure that the equilibrium 
distribution of the chain is the desired target distribution. The extension of the Metropolis- 
Hastings algorithm to the setting where the dimension of the parameter vector varies is more 
challenging theoretically, however the resulting algorithm is surprisingly simple to follow. 
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For the construction of a Markov chain on a general state space with invariant or 
stationary distribution vr, the detailed balance condition can be written as 



7r{de)p{e,de') = n{de')p{e\de) (1.1.2) 

(0,6»')e^xB J{e,0')£AxB 



for al 



mm) 



Borel sets A x B G @, where P is a general Markov transition kernel (e.g. iGreen 



As with the standard Metropolis-Hastings algorithm, Markov chain transitions from a 
current state = {k, 0^) G ^ in model A^^ are realised by first proposing a new state 
d' = {k', 6k>) G in model Aik' from a proposal distribution q{0, 6'). The detailed balance 
condition fll.l.2p is enforced through the acceptance probability, where the move to the 
candidate state 6' is accepted with probability a(0, 0') . If rej e cted, the ch ain remains at the 



current state in model M.k- Under this mechanism (IGreen 
becomes 



2001 



2003|), Equation ffTX^ 



I 71(6 I x)q{0, e')a{e, 0')d0d0' = [ tx{0' \ x)q{0\ 0)a{0' , 0)d0d0', 

(1.1.3) 

where the distributions 7i{0 \ x) and 7r{0' \ x) are posterior distributions with respect to 
model A4k and A4k' respectively. 

One way to enforce Equation (I1.1.3P is by setting the acceptance probability as 



ff) f)'^ ■ )\ <G\x)q{0,0') 



:i.i.4) 



where a{0\ 
ratio ( IGreen 



is similarly de f ined. This resembles the usual Metropolis-Hastings acceptance 



1995 



Tierney 



19981 ). It is straightforward to observe that this formulation 



includes the standard Metropolis-Hastings algorithm as a special case. 

Accordingly, a reversible jump sampler with iterations is commonly constructed as: 



Step 1: Initialise k and 0k at iteration t = 1. 
Step 2: For iteration t > 1 perform 
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— Within-model move: with a fixed model k, update the parameters 9^ according to 
any MCMC updating scheme. 

— Between-models move: simultaneously update model indicator k and the parame- 
ters 0k according to the general reversible proposal/acceptance mechanism (Equa- 
tion [IXID. 

Step 3: Increment iteration t = t + 1. If t < A^, go to Step 2. 



1.1.2 Application areas 



Statistical problems in which the number of unknown model parameters is itself unknown 
are extensive, and as such the reversible jump sampler has been implemented in analyses 
throughout a wide range of scientific disciplines over the last 15 years. Within the statistical 



literature, these predominantly concern Bayesian model determination problems (jSisson 
20051 ). Some of the commonly recurring models in this setting are described below. 



Change-point models: One of the original applications of the reversible jump sampler 
was in Bayesian change-point problems, where both the numbe r and l ocation of change- 
points in a system is unknown a priori. For example, iGreeru (119951 ) analysed mining 



disaster count data using a Poisson process with the rate pa rameter described as a 
step function with an unknown number and location of steps. iFan and Brooks! (120001 ) 
applied the reversible jump sampler to model the shape of prehistoric tombs, where the 
curvature of the dome changes an unknown number of times. Figure 11.1( a) shows the 
plot of depths and radii of one of the tombs from Crete in Greece. The data appear to 
be piecewise log-linear, with possibly two or three change-points. 



Figure 11.11 near here. 



Finite mixture models: Mixture models are commonly used where each data observation 
is generated according to some underlying categorical mechanism. This mechanism is 
typically unobserved, so there is uncertainty regarding which component of the resulting 
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mixture distribution each data observation was generated from, in addition to uncer- 
tainty over the number of mixture components. A mixture model with k components 
for the observed data x takes the form 



:i.l.5) 



with = (0^, . . . , (f))^), where Wj is the weight of the j mixture component fj, whose 
parameter vector is denoted by (/) •, and where Yl^j=i'^j ~ ^- The number of mixture 



components, k, is also unknown. 

Figure ll-lfb) illustrates the dist r ibutio n of enzymatic activity in the blood for 245 indi- 



viduals. 



Richardson and GreenI (Il997f ) analysed these data using a mixture of Normal 



densities to identify subgroups of slow or fast metabolizers. The multi-modal nature of 
the data s uggests the exi s tence of such groups, but the number of distinct groupings is 



less clear. 



Tadesse et al. 



( I2OO5I ) extend this Normal mixture model for the purpose of 



clustering high-dimensional data. 

Variable selection: The problem of variable selection arises when modelling the relation- 
ship between a response variable, Y, and p potential explanatory variables xi, . . .Xp. 
The multi-model setting emerges when attempting to identify the most relevant sub- 
sets of predictors, making it a natural candidate for the reversible jump sampler. For 
example, under a regression model with Normal errors we have 



Y = X^I3^ + e 



with 



e ~ 



iV(0,a2/) 



:i.l.6) 



where 7 = (71, ... , 7p) is a binary vector indexing the subset of Xi, . . . to be included 
in the linear model, is the design matrix whose columns correspond to the indexed 
subset given by 7, and is the corresponding subset of regr ession coefficients. For 
examp l es and algorithm s in th is se tting and beyond see e .g. iGeorge and McCulloch 
(ll993Usmith and KohnI mm and hott and Leontel \mk- 



Non-parametrics: Within Bayesian non-parametrics, many authors have successfully ex- 
plored the use of the reversible jump sampler as a method to autom ate the knot se l ection 



process when using a P-th order spline model for curve fitting (jPenison et al. 



1998 
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DiMatteo et al.l . 1200 ll ). Here, a curve / is estimated by 



P k 

f{x) = ao + ajX^ + y]i{x — Kj)^, x G [a, h] 



:i.i.7) 



2=1 



where = maxfO, z) and K j , i = l,...,k, represent the locations of k knot points 
( iHastie and Tibshiranil . Il990l ). Under this representation, fitting the curve consists of 
estimating the unknown number of knots k, the knot locations Ki and the corresponding 
regression coefficients aj and rji, for j = 0, . . . , P and i = 1, . . . ,k. 

Time series models: In the modelling of temporally dependent data, xi, . . . xt, multiple 
models naturally arise under uncertainty over the degree of dependence. For example, 
under a k-th order autoregressive process 



with 



t = k + l,...,T 



:i.l.8) 



r=l 



with et ~ iyA^(0,cr^), the orde r, k, of the a u toregre s sion is commonly unknow n, in 
addition to the coeffi cients Qr- iBrooks et al.l (l2003d ). lEhlers and Brooksl ( l2003l ) and 



Vermaak et al. 



(120041 ) each detail descriptions on the use of reversible jump samplers 



for this class of problems. 



The reversible jump algorithm has had a compelling influence in the statistical and main- 
stream scientiflc research literatures. In general, the lar ge majority of application areas have 



tended to be computationally or biologically related f lSissoru . |2005[ ). Accordingly a large 
number of developmental and application studies can be found in the signal processing lit- 
erature and the related flelds of computer vision and image analysis. Epidemiological and 
medical studies also feature strongly. 

This article is structured as follows: In Section 11.21 we present a detailed description of 
how to implement the reversible jump sampler and review methods to improve sampler per- 
formance. Section [L3] examines post-simulation analysis, including label switching problems 
when identiflability is an issue, and convergence assessment. In Section [L4l we review related 
sampling methods in the statistical literature, and conclude with discussion on possible fu- 
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ture rese arch di r ection s for the field. Oth er useful reviews of reversible jump MCMC can be 
found in Icreenl hoom and|sisson| koom . 



1.2 Implementation 

In practice, the construction of proposal moves between different models is achieved via the 
concept of "dimension matching". Most simply, under a general Bayesian model determi- 
nation setting, suppose that we are currently in state {k, 6^) in model M.ki and we wish to 
propose a move to a state {k',0'f^,) in model Aik', which is of a higher dimension, so that 
f^k' > In order to "match dimensions" between the two model states, a random vector 
u of length dk_^k' = — is generated from a known density Qd^^y The current state 
Ok and the random vector u are then mapped to the new state 6'^,, = gk^k'{Gk, u) through a 
one-to-one mapping function Qk-^k' '■ "^"^ x T^'^'" 7^"*' . The acceptance probability of this 
proposal, combined with the joint posterior expression of Equation fll.l.ip becomes 



a[ik,Ok),ik',0',,)]=mmll, 



n{k',ei, 


x)q{k' — )■ k) 


dgk^k'i0k,u) 


7r{k,0k 


x)q{k -> k')qd^^^,{u) 


d{Ok,u) 



:i.2.i) 



where q{k — k') denotes the probability of proposing a move from model Aik to model 
Aik', and the final term is the determinant of the Jacobian matrix, often referred to in the 
reversible jump literature simply as the Jacobian. This term arises through the change of 
variables via the function Qk-^k'-, which is required when used with respect to the integral 
equation fll.l.Sp . Note that the normalisation constant in Equation (11. 1.11) is not needed 
to evaluate the above ratio. The reverse move proposal, from model Aiy to M.k is made 
deterministically in this setting, and is accepted with probability 



a[{k\e',,),{k,ek)] = a[{k,ek),{k\e',,)]-\ 



More generally, we can relax the condition on the length of the vector u by allowing dk^k' > 
nk' — nk- In this case, non-deterministic reverse moves can be made by generating a dk'^k- 
dimensional random vector u' ~ q(iy^^,{u'), such that the dimension matching condition, 
Uk + dk^k' = nk' + dk'^k, is satisfied. Then a reverse mapping is given by Ok = gk'^k{0'k': 
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such that Ok = gk'-^k{gk-^k'i0k,u),u') and 6'^, = gk^k'igk'^k{0'k',u'),u). The corresponding 
acceptance probabihty to Equation fll.2.ip then becomes 



a[{k, 6k), {k\ 0'^,)] = min <^ 1 



n{k',e'f^, I x)q{k' k)qd..Au' 



Tr{k,6k I x)q{k -)> k')qd 
Example: Dimension matching 



dgk-,k'{Ok,u) 



d{Ok,u) 



[1.2.2) 



Green 


(1995 


) and 


Brooks 


(1998) 



model Ail has states {k = l,6i E IZ^) and model A^2 has states {k = 2,^2 G 'R?). Let 
(1,^*) denote the current state in Aii and (2, {6^^\6^'^'>)) denote the proposed state in A^2- 
Under dimension matching, we might generate a random scalar m, and let 6^^^ = 6* +u and 

reverse move given deterministically by 9* = ^{9^^^ + ^(2)). 

Example: Moment matching in a finite mixture of univariate Normals 

Under the finite mixture of univariate Normals model, the observed data, x, has density 
given by Equation fll.l.Sp . where the j-th mixtur e component fj(x \ 4>j) = (j)(x \ ^ij^Uj) is 
the N{fij,aj) density. For between- model moves, iRichardson and Green! (119971 ) implement 
a split (one component into two) and merge (two components into one) strategy which 
satisfies the dimension matching requirement. (See iDellaportas and Papageorgiou J2OO6I ) 
for an alternative approach.) 



W 



(11997f ) 



len two Normal components ji and ^2 are merged into one, j*, [Richardson and Green 
propose a deterministic mapping which maintains the 0"", 1°' and 2°'' moments: 



,- ■ 4: 



[1.2.3) 
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The split move is proposed as 



Wj^ = Wj* * Ui, Wj^ = Wj* * [I — Ui^ 



N2 = +U2(rj'J^ (1.2.4) 



J2 



4 = «3(i-w2)45: 

< = (1-U3)(l-Ui)4^ 



'02 



where the random scalars ui, U2 ~ Beta(2, 2) and ^3 ~ Beta(l, 1). In this manner, dimension 
matching is satisfied, and the acceptance probabihty for the spht move is calculated according 
to Equation fll.2.ip . with the acceptance probability of the reverse merge move given by the 
reciprocal of this value. 



1.2.1 Mapping functions and proposal distributions 

While the ideas behind dimension matching are conceptually simple, their implementation 
is complicated by the arbitrariness of the mapping function Qk^k' and the proposal distri- 
butions, qd^^y{u)., for the random vectors u. Since mapping functions effectively express 
functional relationships between the parameters of different models, good mapping functions 
will clearly improve sampler performance in terms of between-model acceptance rates and 
chain mixing. The difficulty is that even in the simpler setting of nested models, good re- 
lationships can be hard to define, and in more general settings, parameter vectors between 
models may not be obviously comparable. 

The only additional degree of freedom to improve between-model proposals is by choos- 
ing the form and parameters of the proposal distribution qdi^^y{u). However, there are no 
obvious criteria to guide this choice. Contrast this to within-model, random-walk Metropolis- 
Hastings moves on a continuous target density, whereby proposed moves close to the current 
state can have an arbitrarily large acceptance probability, and proposed moves far from 
the current state have low acceptance probabilities. This concept of "local" moves may be 
partially translated on to model space [k G /C): proposals from Ok in model Aik to O'j^, in 
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model Aik' will tend to have larger acceptance probabilities if their likelihood values are 
similar i .e. L(x \ k,0k) ~ L(x Ik^^^O'i^,). For example, in the analysis of Bayesian mixture 



models, 



Richardson and GreenI (119971 ) propose "birth/death" and "split /merge" mappings 



of mixture components for the between-model move, while keeping other components un- 
changed. Hence the proposed moves necessarily will have similar likelihood values to the 
current state. However, in general the notion of "local" move proposals does not easily ex- 
tend to the parameter vectors of different models, unless considering simplified settings (e.g. 
nested models). In the general case, good mixing properties are achieved by the alignment 
of regions of high posterior probability between models. 

Notwithstanding these difficulties, reversible jump MCMC is often associated with poor 
sampler performance. However, failure to realise acceptable sampler performance should 
only be considered a result of poorly constructed between-model mappings or inappropri- 
ate proposal distributions. It should even be anticipated that implementing a multi-model 
sampler may result in improved chain mixing, even when the inferential target distribution 
is a single model. In this case, sampling from a single model posterior with an "overly- 
sophisticated" machinery is loosely analogous with the extra performance gained with aug- 
mented state space sampling methods . For example, in the case of a finite mixture of Normal 



distributions. 



Richardson and Green! (119971 ) report markedly superior sampler mixing when 



conditioning on there being exactly three mi xture compon e nts, in comparison with the out- 
put generated by a fixed-dimension sampler. iGeorge et al.l (119991 ) similarly obtain improved 
chain performance in a single model, by performing "birth- then-d e ath" moves simultane- 
ously so that the dimension of the model remains constant. I GreenI (12003! ) presents a short 



study on which inferential circumstances determine whe ther the adoptio n of a multi-model 
sampler may be beneficial in this manner. Conversely, iHan and CarlinI (120011 ) provide an 



argument to suggest that multi-model sampling may have a detrimental effect on efficiency. 



1.2.2 Marginalisation and augmentation 

Depending on the aim or the complexity of a multi-model analysis, it may be that use of re- 
versible jump MCMC would be somewhat heavy-handed, when reduced- or fixed- dimensional 
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samplers may be substituted. In some Bayesian model selection settings, between-model 
moves can be greatly simplified or even avoided if one is prepared to make certain prior 
assumptions, such as conjugacy or objective prior specifications. In such cases, it may be 
possible to analytically integrate out some or all of the parameters 6k in the posterior distri- 
bution (ll.l.ip . reducing the sampler either to fixed dimensions, e.g. on model space G /C 



only, or to a lower-diinensional set of model and para r neter space (jBerger and Pericchi 



DiMatteo et al. 



2001 



George and McCuUoch 



1993 



Tadesse et al. 



2001 



20051 ). In lower dimen- 



sions, the reversible jump sampler is often easier to implement, as the problems associated 
with mapping function specification are conceptually simpler to resolve. 

Example: Marginalisation in variable selection 

In Bayesian variable selection for Normal linear models (Equation ll.l.6p . the vector 7 = 
(71, . . . ,7p) is treated as an auxiliary (model indicator) variable, where 



1 if predictor Xj is included in the regression 
otherwise. 



Under certain prior specifications for the regression coefficients /3 and error variance cr^, the 
/3 coefficients can be analytically integ rated out of the posterior. A Gibbs sampler d i rectly 



on model space is then available for 7 (IGeorge and McCuUoch 



1993 



Smith and Kohn 



19961). 



Nott and Greenl . 



2004 



Example: Marginalisation in finite mixture of multivariate Normal models 

Within t he context of c l usteri ng, the parameters of the Normal components are usually not of 



interest. 



Tadesse et al. 



( 120051 ) demonstrate that by choosing appropriate prior distributions, 
the parameters of the Normal components can be analytically integrated out of the posterior. 
The reversible jump sampler may then run on a much reduced parameter space, which is 
simpler and more efficient. 



In a general setting. 



Brooks et al 



fl2003d ) proposed a class of models based on augmenting 
the state space of the target posterior with an auxiliary set of state-dependent variables, Vk, 
so that the state space of 7i{k,0k,Vk \ x) = 7i{k,6k \ x)Tk{vk) is of constant dimension for 
all models Aik G M.- By updating Vk via a (deliberately) slowly mixing Markov chain, a 
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temporal memory is induced that persists in the from state to state. In this manner, the 
motivation behind the auxihary variables is to im prove betweeii - model proposals, in that 
some memory of previous model states is retained. iBrooks et al.l ( l2003d ) demonstrate that 



this approach can significantly enhance mixing compared to an unassisted reversible jump 
sampler. Although the fixed dimensionality of {k, 6k, Vk) is later relax e d, there is an o 



DVIOUS 



analogue with product space sampling frameworks (jCarlin and Chib 
See Section [1. 4. 2[ 



1995 



GodsiU . 



20011 ) 



Liu et al. 



All alternative augmented state space modification of standard MCMC is given by[ 
(j200ll ). The dynamic weighting algorithm augments the original state space by a weight- 
ing factor, which permits the Markov chain to make large transitions not allowable by the 
standard transition rules, subject to the computation of the correct weighting factor. Infer- 
ence is then made by using the weights to compute importance sampling estimates rather 
than simple Monte Carlo estimates. This method can be used within the reversible jump 
algorithm to facilitate cross-model jumps. 



1.2.3 Centering and order methods 



Brooks et al.l (l2003d ) introduce a class of methods to achieve the automatic scaling of the 
proposal density, qdf.^y{u), based on "local" move proposal distributions, which are centered 
around the point of equal likelihood values under current and proposed models. Under 
this scheme, it is assumed that local mapping functions Qk-^k' are known. For a proposed 
move from {k,Ok) in Aik to model J^k', the random vector "centering point" Ck^wiGk) = 
gk->k'{Gk, u), is defined such that for some particular choice of proposal vector u, the current 
and proposed states are identical in terms of likelihood contribution i.e. L{x \ k, 0k) = L{x \ 
k', Ck-^k'{Gk))- For example, if Aik is an autoregressive model of order k (Equation ll.1.81) and 
A^fc/ is an autoregressive model of order k' = k + 1, and if Ck^k'{Gk) = gk^k'{Gk, u) = {Ok, u) 
(e.g. a local "birth" proposal), then we have n = and Ck-^k' = (^fc,0), as L{x \ k,6k) = 
L{x I fc',(6>fc,0)). 

Given the centering constraint on u, if the scaling parameter in the proposal (id^_^y {u) 
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is a scalar, then the 0*'*-order method (jBrooks et aLl . l2003cl ) proposes to choose this scahng 
parameter such that the acceptance probabihty a[{k,Ok),{k',Ck^kr{Ok))] of a move to the 
centering point Ck^k'i^k) in model A^^/ is exactly one. The argument is then that move 
proposals close to Ck^k'i^k) will also have a large acceptance probability. 

For proposal distributions, Qd^^^ (u), with additional degrees of freedom, a similar method 
based on a series of ra'^^-order conditions (for n > 1), requires that for the proposed move, 
the n*'' derivative (with respect to u) of the acceptance probability equals the zero vector at 
the centering point Ck^ki[6k): 



V"ap,6l,),(A;',c,^fc,(6lfc))] = 0. 



:i.2.5) 



That is, the m unknown parameters in the proposal distribution qd^.^y (u) are determined 
by solving the m simultaneous equations given by fll.2.5p with n = l,...,m. The idea 
behind the n^'^-order method is that the concept of closeness to the centering point under 
the 0*''-order method is relaxed. By enforcing zero derivatives of a[{k,0k), {k' ,Ck~^k'{0k))], 
the acceptance probability will become flatter around Ck^k'i^k)- Accordingly this allows 
proposals further away from the centering point to still be accepted with a reasonably high 
probability. This will ultimately induce improved chain mixing. 

With these methods, proposal distribution parameters are adapted to the current state 
of the chain, {k,0k), rather than relying on a constant proposal parameter vector for all 
state transitions. It can be shown that for a simple two m odel case, the n^^-order conditions 
are optimal in terms of th e capacitance of the algorithm (ILawler and Sokal 



Ehlers and Brooks! (120031 ) for an extension to a more general setting, and 
(120031 ) for a centering method in the context of linear models. 



1988 



). See also 



Ntzoufras et al 



One caveat with the centering schemes is that they require specification of the between 
model mapping function gk^k', although these methods compensate for poor choices of map- 
ping functions by s electi ng the best set of parameters for the given mapping. Recently, 
Ehlers and Brooksl (120081 ) suggest the posterior conditional distribution Ti{k',u \ Ok) as the 
proposal for the random vector n, side-stepping the need to construct a mapping function. 
In this case, the full conditionals must either be known, or need to be approximated. 



14 CHAPTER 1. REVERSIBLE JUMP MCMC 



Example: 



Brooks et al. 



he 0^^- order method for an autoregressive model 



( l2003d ) considers the AR model with unknown order k (Equation ll.l.Sp . as- 
suming Gaussian noise et ~ A^(0,cr^) and a uniform prior on k where k = l,2,...kmax- 
Within each model Aik, independent A^(0,cr^) priors are adopted for the AR coefficients 
aT-,T = l,...,k, with an inverse gamma prior for a^. Suppose moves are made from 
model Aik to model Ai^' such that k' = k + 1. The move from 6^ to O'f^, is achieved 
by generating a random scalar u ~ q{u) = N{0, 1), and defining the mapping function as 
^'k' — 9k^k'{0k, u) = {Ok, cru). The centering point Ck^k'^O^) then occurs at the point u = 0, 
or 6',, = (6>fc,0). 

Under the mapping gk^k', the Jacobian is a, and the acceptance probability (Equa- 
tion [LlIT]) for the move from {k,6k) to {k' ,Ck^k'{Gk)) is given by a[{k,Ok),{k' ,{Ok,Q))] = 
min(l, A) where 

_ 7^(fc^ (0fc,O) I x)q{k' -> k)a _ {2n aiy'/^q{k' -> k)a 
7r{k,ek\x)q{k^k')q{0) q{k ^ k'){2^)-^/^ ' 

Note that since the likelihoods are equal at the centering point, and the priors common to 
both models cancel in the posterior ratio, A is only a function of the prior density for the 
parameter a^+i evaluated at 0, the proposal distributions and the Jacobian. Hence we solve 
y4 = 1 to obtain 

2 2 

cr = at 



q{k ^ k') 
q{k' -> k) 

Thus in this case, the proposal variance is not model parameter [Oj.) or data {x) dependent. 
It depends only on the prior variance, o"a, and the model states, fc, k' . 

Example: The second-order method for moment matching 

Consider the moment matching in a finite mixture of univariate Normals example of Section 
11.21 The mapping functions gy^k and gk^y are respectively given by Equations fll.2.3p and 
f ll.2.4p . with the random numbers Ui, U2 and ^3 drawn from independent Beta distributions 
with unknown parameter values, so that qpi,qi{ui): Ui ~ Beta(pj,gi), i = 1,2,3. 



Coii sider the split move. Equation f ll.2.4p . To apply the second order method of lBrooks et al 
( l2003d ). we first locate a centering point, Ck^k'i^k), achieved by setting mi = 1, M2 = and 



1.2. IMPLEMENTATION 



15 



M3 = Ml = 1 by inspection. Hence, at the centering point, the two new (spht) components 
ji and j2 will have the same location and scale as the j* component, with new weights 
Wj-^ = Wj* and = and all observations allocated to component ji. Accordingly this will 
produce identical likelihood contributions. Note that to obtain equal variances for the split 
proposal, substitute the expressions for Wj-^ and Wj^ into those for aj^ = aj^- 



Following iRichardson and GreenI (119971 ). the acceptance probability of the split move 



evaluated at the centering point is then proportional (with respect to u) to 
log A[{k, 6 k),{k',Ck^k' (Ok))] oc 

Iji log(Wii) + lj2 ^Og{Wj,) - ^ log((T|J - ^ log(cT|J - 2^ T!llliyi - ^^31? 

h 

-2^^ Y!i=i{yi - N2? + (5 - 1 + In) log(«^ii) + (5 - 1 + In) log(^j2) 

-logbpi,gi(^i)] - log[gp2,g2(n2)] - log[gp3,q3(n3)] + logd^j, - flj^l) 
+ log(a|J + log(a|^) - log(n2) - log(l - u^) - hgiu^) - log(l - ug), 

(1.2.6) 

where Ij-^ and Ij^ respectively denote the number of observations allo cated to components j^ 
and jo , and where S,a, (3,^ and n are hyperparameters as defined by 



Richardson and Green 



fll997h . 



Thus, for example, to obtain the proposal parameter values pi and qi for ui, we solve the 
first- and second-order derivatives of the acceptance probability fll.2.6p with respect to ui. 
This yields 



d log a[{k, e k), {k', Ck^k' {0 k))] _ 6 + 21 - pi qi - 6 - 21 



32 



dUi Ui (1 — Ml) 

dHoga[ik,0k),ik',Ck^k'i0k))] ^ S + 2lj, - pi ^ qi-5- 2/,, 
duf u\ (1 — uiY 

Equating these to zero and solving for pi and qi at the centering points (with Ij^ = Ij* and 
lj2 = 0) gives pi = 6 + 2lj* and qi = 6. Thus the parameter pi depends on the number of 
observations allocated to the component being split. Similar calculations to the above give 
solutions for p2, q2,P3 and ^3. 
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1.2.4 Multi-step proposals 



Green and Miral (120011 ) introduce a procedure for learning fr om rejected between-mod el pro- 
posals based on an extension of the splitting rejection idea of iTierney and Miral (119991 ) . After 
rejecting a between-model proposal, the procedure makes a second proposal, usually under 
a modified proposal mechanism, and potentially dependent on the value of the rejected pro- 
posal. In this manner, a limited form of adaptive behaviour may be incorporated into the 
proposals. The procedure is implemented via a modified Metropolis-H astings acceptanc e 



probability, and may be extended to more than one sequential rejection (ITrias et al 



20091) 



Delayed-rejection schemes can reduce the asymptotic varianc e of ergodic averages by reduc 



1973 



Tierney 



19981) 



ing the probability of the chain remaining in the same state (iPeskunl . 
however there is an obvious trade-off with the extra move construction and computation 
required. 



For clarity of exposition, in the remainder of this section we denote the current state of 
the Markov chain in model Aik by a; = {k,0k), and the first and second stage proposed 
states in model J^k' by y and z. Let y = (^[^^./(a;, ui) and z = gf\^,{x,u-i,U2) be the 
mappings of the current state and random vectors u\ ~ q"^^ ,("^^1) and U2 ~ g^^'' ,(1*2) 
into the proposed new states. For simplicity, we again consider the framework where the 
dimension of model Aik is smaller than that of model Mk' (i-e. nk> > Uk) and where the 
reverse move proposals are deterministic. The proposal from x io y is accepted with the 
usual acceptance probability 



ai{x,y) = min <^ 1, 



■K{y)q{k' k) 



Ti{x)q{k k')qf^^^X'^^) 



d{x,ui) 



If y is rejected, detailed balance for the move from a; to z is preserved with the acceptance 
probability 



a2{x, z) = min < 1, 



7i{z)q{k' -^k)[l-a,{y*,z)-'] 



n{x)q{k /c')gi'j_^^,K)gyj_^^,(w2)[l - ai{x,y)] 



(2) 



dgfl^,{x,ui,U2) 



9(£C, 1*1,^2) 



where y* = 5f^,^^,,(2;, wi). Note that the second stage proposal z = g'jf^k'i^^'^i^'^^) 



(2) 
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permitted to depend on the rejected first stage proposal y (a function of x and iti). 



Al-Awadfii et al 



(j2004f ) also acknowledge that an initial between-model 



In a similar vein, 

proposal x' = gk_^kr{x, u) may be poor, and seek to adjust the state x' to a region of higher 
posterior probabi l ity be fore taking the decision to accept or reject the proposal. Specifically, 



Al-Awadhi et al. 



(j2004j ) propose to initially evaluate the proposed move to x' in model Aik' 
through a density 7i*{x') rather than the usual 7r{x'). The authors suggest taking it* to be 
some tempered distribution tt* = ir'^, 7 > 1, such that the modes of vr* and vr are aligned. 

The algorithm then implements k > 1 fixed- dimension MCMC updates, generating states 
x' ^ x^ ^ . . . ^ x'^ = X* , with each step satisfying detailed balance with respect to tt*. 
This provides an opportunity for x* to move closer to the mode of it* (and therefore tt) than 
x'. The move from x in model Mk to the final state x* in model Mk' (with density 7r(a;*)) 
is finally accepted with probability 



a{x, X*) = min < 1, 



TT 



7r{x*)7r*{x')q{k' k) 
'x)TT*{x*)q{k k')qd^^^,{u) 



dgk^k'ix,u) 



d{x, u) 



The implied reverse move from model Aik' to model model Aik is conducted by taking the 
K moves with respect to tt* first, followed by the dimension changing move. 

Various extensions can easily be incorporated into this framework, such as using a se- 
quence of TT* distributions, resulting in a slightly modified acce ptance probability expression. 
For instance, the standard simulated annealing framework, iKirkpatrickl (1l984l ). provides 
an example of a sequence of distributions which encourage moves towards posterior mode. 
Clearly the choice of the distribution tt* can be crucial to the success of this strategy. As 
with all multi-step proposals, increased computational overheads are traded for potentially 
enhanced between-model mixing. 



1.2.5 Generic samplers 



The problem of efficiently constructing between-model mapping templates, Qk-^y^ with as- 
sociated random vector proposal densities, Qd^^^y, niay be approached from an alternative 
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perspective. Rather than relying on a user-specified mapping, one strategy would be to move 
towards a more generic proposal mechanism altogether. A clear benefit of generic between- 
model moves is that they may be equally be implemented for non-nested models. While the 
ideal of "black-box" between-model proposals are an attractive ideal, they currently remain 
on the research horizon. However, a number of automatic reversible jump MCMC samplers 
have been proposed. 



of 



Green! (120031) p roposed a reversible jump analogy of the random-walk Metropolis sampler 



Roberts! fl2003l ). Suppose that estimates of the first and second order moments of 0^ 
are available, for each of a small number of models, G /C, denoted by fx^. and B^BJ 
respectively, where is an Uk x ra^ matrix. In proposing a move from [k, 6k) to model 
A^fc/, a new parameter vector is proposed by 



01 



' fik' + Bk' [RB-\ek-tXk)T^' 
Bk'RBl^{Ok - fJ-k) 

B^\ek-fik) 

u 




if ny < Uk 
if rifc/ = Uk 

if ny > Uk 



where 



denotes the first m components of a vector, is a orthogonal matrix of order 



max{nfc, nfc'}, and u ~ qn^,^ni^{u) is an (n^' — nfc)-dimensional random vector (only utilised 
if rik' > Uk, OT when calculating the acceptance probability of the reverse move from model 
Mk' to model Aik if "^k' < ^k)- If < ^k-, then the proposal 6'^ is deterministic and the 
Jacobian is trivially calculated. Hence the acceptance probability is given by 



a[{k,dk)Ak\e'k,)] 





\x) q{k' k) \Bk'\ 


7r{k,0k\ 


x) q{k k') 


Bk\ 



X 



[u) for Hk' < Uk 
for Uk' = Uk 

l/qn^,-nk{u) for Uk' > Uk 



1 



Accordingly, if the model-specific densities 7r(/c, 0fe|a;) are uni-modal with first and second 
order moments given by ^k ^"^^ BkBj, then high between- model acceptance probabilities 
may be ach ieved. ( Unita r y acceptance p roba bilities are av ailable if the ir^k, 0k\x) are exactly 



Gaussian). !Green! f!2003h . !Godsilll toom and 



Hastid (12004! ) discuss a number of modifications 



to this general framework, including improving efficiency and relaxing the requirement of 



1.2. IMPLEMENTATION 



19 



unimodal densities 7r{k,0k\x) to realise high between-model acceptance rates. Naturally, 
the required knowledge of first and second order moments of each model density will restrict 
the applicability of these approaches to moderate numbers of candidate models if these 
require estimation (e.g. via pilot chains). 



With a similar motivation to the above, 



Papathomas et al.l ( l2009l ) propose the multivariate 



Normal as proposal distribution for 0'^, in the context of linear regression models, so that 
6')^, ~ A^(/x^,|0^, The authors derive estimates for the mean fJ.k'\et covariance 

Sfe'|0j. such that the proposed values for will on average produce similar conditional 
posterior values under model Aik' as the vector 6^ under model Aik- In particular, consider 
the Normal linear model in Equation fll.l.6p . re- writing the error covariance as V, assuming 
equality under the two models such that Vk = Vk' = V. The parameters of the proposal 
distribution for 0'^, are then given by 

t^k'ie, = {X^,V-'Xy)-^X:;,V-^{Y + B-^V-'/'{X,0k-PkY)} 

Sfc'lflfe = Qk',k' — Q k' ,k'Q k']kQ k,kQ kJt'Q k' ,k' + c/„j., 

where 7 and 7' are indicators corresponding to models Aik and ^Ak', B = (y+XySfc/|0^Xy 
Pk = X^{X^V-'^X^)-'^X^V-'^, Qk,k' = (X^V-^Xyy^, /„, is the n X n identity matrix and 
c > 0. Intuitively, the mean of this proposal distribution may be interpreted as the maximum 
likelihood estimate of O'l^, for model A4k', plus a correction term based on the distance of 
the current chain state 0^ to the mode of the posterior density in model Aik- The mapping 
between 0'^, and 0^ and the random number u is given by 



a' — _L 

where u ~ A^(0, In,.,)- Accordingly the Jacobian corresponding to Equation fll.2.2p is given by 



a/2 



. Under this construction, the value c > is treated as a tuning parameter for 
the calibration of the acceptance probability. Quite clearly, the parameters of the between- 
model proposal do not require a priori estimation, and they adapt to the current state of 
the chain. The authors note that in some instances, this method produces similar results 
in terms of efficiency as GreeJ ( 2003 ). One caveat is that the calculations at each proposal 



20 



CHAPTER 1. REVERSIBLE JUMP MCMC 



stage involve several inversions of matrices which can be computationally costly when the 
dimension is large. In addition, the method is theoretically justified for Normal linear models, 
but can be applied to non- Normal models when tran sformation of data to Normality is 



available, as demonstrated in 



Papathomas et al.l ( 120091 ) 



Fan et al. 



(120091 ) propose to construct between-model proposals based on estimating con- 
ditional marginal densities. Suppose that it is reasonable to assume some structural similar- 
ities between the parameters dk and of models J^k and M.k' respectively. Let c indicate 
the subset of the vectors 0^ = {01,0'^^^) and O'l^, = {01, , O]^,'^) which can be kept constant 
between models, so that 91, = O^. The remaining r-dimensional vector 0~^^ is then sampled 
from an estimate of the factorisation of the conditional posterior of ^^/'^ = {6\,, . . . ,91.,) under 
model J^k'- 



r-iydl, ^ 



9l,,el,,x)TTr 



Ol„x) 



The proposal Bj^,'^ is drawn by first estimating ^r(^fc' I ^k'l'^) ^"^^ sampling 6*^,, and by then 



estimating vr. 



nr-l 



9 L, Oj,, x) a n d sarn pling 9^, , conditioning on the previously sampled 



point, 91,, and so on. 



Fan et al. 



(120091 ) construct the conditional marginal densities by 
using partial derivatives of the joint density, nik' ,6'^ | a;), to provide gradient information 
within a marginal density estimator. As the conditional marginal density estimators are 
constructed using a combination of samples from the prior distribution and gridded values, 
they can be computationally expensive to construct, particularly if high-dimensional moves 
are attempted e.g. 6~^f^ = 6^,. However, this approach can be efficient, and also adapts to 
the current state of the sampler. 



1.3 Post simulation 
1.3.1 Label switching 

The so-called "label switching" problem occurs when the posterior distribution is invariant 
under permutations in the labelling of the parameters. This results in the parameters having 
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identical marginal posterior distributions. For example, in the context of a finite mixture 
model (Equation ll.l.Sp . the parameters of each mixture component, (fij, are unidentifiable 
under a symmetric prior. This causes problems in the interpretation of the MCMC output. 
While this problem is general, in that it is not restricted to the multi-model case, as many 
applications of the reversible jump sampler encounter this type of problem, we discuss some 
methods of overcoming this issue below. 

The conceptually simplest method of circumventing nonidentifiability is to impose artifi- 
cial constraints on the parameters. For example, if Hj denotes the m ean of the ?-th Gaussian 



mixture component, then one such constraint could he fii < . . . < fij. (IRichardson and Green 



1997 ). However, the effectiveness of this approach is not always guaranteed (jJasra et al. 



20051 ). One of the main problems with such constraints is that they are often artificial, be- 
ing imposed for inferential convenience rather than as a result of genuine knowledge about 
the model. Furtherrnqre, su itable constraints can be difficult or almost impossible to find 
jFruhwirth-Schnatterl . boOlf ). 



Alter native approache s to handling nonidentifiability involve the post-processing of MCMC 
StephensI ( 2000bl ) gives an inferential method based on the relabelling o 



output. 



with r e spect to the permu tatio n which minimise s the p osterior expected loss. 



fl2000h 



Hurn et al. 



components 



Celeux et al. 



(120031 ) and ISisson and Hurd fl2004l ) adopt a fully decision-theoretic ap- 
proach, where for every posterior quantity of interest, an appropriate (possibly multi-model) 
loss function is constructed and minimised. Each of these methods can be computationally 



expensive. 



1.3.2 Convergence assessment 

Under the assumption that an acceptably efficient method of constructing a reversible jump 
sampler is available, one obvious pre-requisite to inference is that the Markov chain converges 
to its equilibrium state. Even in fixed dimension problems, theoretical convergence bounds 
are in general difficult or impossible to determine. In the absence of such theoretical results, 
convergence diagnostics based on empirical statistics computed from the sample path of 
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multiple chains are often the only available tool. An obvious drawback of the empirical 
approach is that such diagnostics invariably fail to detect a lack of convergence when parts 
of the target distribution are missed entirely by all replicate chains . Accordingly, th e se are 
nece ssary rather than sufficien t indicators of chain convergence (see iMengersen et al.l (jl999[ ) 



and 



Cowles and CarlinI ( 119961 ) for comparative reviews under fixed dimension MCMC). 



The reversible jump sampler generates additional problems in the design of suitable em- 
pirical diagnostics, since most of these depend on the identification of suitable scalar statistics 
of the parameters sample paths. However, in the multi-model case, these statistics may no 
longer retain the same interpretation. In addition, convergence is not only required within 
each of a potentially large number of models, but also across models with respect to posterior 
model probabilities. 

One obvious approach would be the implementation of independent sub-chain assess- 
ments, both wit hin-models and for t he model indicator G /C. With focus purely on 



model selection, iBrooks et al. 



(j2003bl ) propose various diagnostics based on the sample- 
path of the model indicator, k, including non-parametric hypothesis tests such as the and 
Kolmogorov-Smirnov tests. In this manner, distributional assumptions of the models (but 
not the statistics) are circumvented at the price of associating marginal convergence of k 
with convergence of the full posterior density. 



Brooks and Giudicil (120001 ) propose the monitoring of functionals of parameters which re- 



tain their interpretations as the sampler moves between models. The deviance is suggested 
as a default choice in the absence of superior alternatives. A two-way ANOVA decompo- 
sition of the variance of such a functional is forme d over multiple chain repli cations, from 
which the potential scale reduction factor (PSRF) ( Gelman and Ruberu . Il992l ) can be con- 



structed and monitored. 



Castelloe and Zimmerman! (120021 ) extend this approach firstly to 



an unbalanced (weighted) two-way ANOVA, to prevent the PRSF being dominated by a 
few visits to rare models, with the weights bei ng specified in proportion to the frequency of 



model visits. 



Castelloe and Zimmerman! (120021 ) also extend their diagnostic to the multivari- 



ate (MANOVA) setting on the observation that monitoring several functionals of marginal 
parameter subsets is more robust than monitoring a single statistic. This general method is 
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clearly reliant on the identification of useful statistics to monitor, but is also sensitive to the 
extent of approximation induced by violations of the ANOVA assumptions of independence 
and normality. 



Sisson and FanI (120071 ) propose diagnos tics when the underlying model can be formulated 



in the marked point process framework ( iDiggld . Il983l : iStephend . |2000aJ). For example, a 
mixture of an unknown number of univariate normal densities (Equation ll.l.Sp can be rep- 
resented as a set of k events = {wj, fij, (t|), j = 1, . . . , fc, in a region A C M^. Given a 
reference point v G A, in the same space as the events C,j (e.g. v = (w, /i, cr^)), then the 
point-to-nearest-event distance, y, is the distance from the point (v) to the nearest event 
(^j) in A with respect to some distance measure. One can evaluate distributional aspects 
of the events {^j}, through y, as observed from different reference points v. A diagnostic 
can then be constructed based on comparisons between empirical distribution functions of 
the distances y, constructed from Markov chain sample-paths. Intuitively, as the Markov 
chains converge, the distribution functions for y constructed from replicate chains should be 
similar. 



This approach permits the direct comparison of full parameter vectors of varying dimen- 
sion and, as a result, naturally inc orporates a measure o f across model convergence. Due to 



the manner of their construction, 



Sisson and Fan 



(120071 ) are able to monitor an arbitrarily 



large number of such diagnostics. However, while this approach may have some appeal, it is 
limited by the need to construct the model in the marked point process setting. Common 
models which may be formulated in this framework include finite mixture, change point and 
regression models . 

Example: Convergence assessment for finite mixture univariate N ormals 



We consider the reversible jump sampler of 



Richardson and Green! (119971 ) implementing a 



finite mixture of Normals model (Equation 11.1.51) using the enzymatic activity dataset (Fig- 
ure 11.1( b)). For the purpose of assessing performance of the sampler, we implement five 
independent sampler replications of length 400,000 iterations. 



Figure [0] (a.b) illustrates the diagnostic of lBrooks et al.l (l2003bl ) which provides a test for 
between-chain convergence based on posterior model probabilities. The pairwise Kolmogorov- 
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Smirnov and (all chains sim ultaneously) tests assu me independent realisations. Based on 
the estimated convergence rate. iBrooks et al.l (l2003bl ). we retain every 400th iteration to ob- 



tain approximate independence. The Kolmogorov-Smirnov statistic cannot reject immediate 
convergence, with all pairwise chain comparisons well above the critical value of 0.05. The 
statistic cannot reject convergence after the first 10,000 iterations. 



Figure [L2l (c) illustrates the two multivariate PSRF's of lCastelloe and Zimmerman! (120021 ) 
using the deviance as the default statistic to monitor. The solid line shows the ratio of 
between- and within-chain variation; the broken line indicates the ratio of within-model 
variation, and the within-model, within-chain variation. The mPSRF's rapidly approach 1, 
suggesting convergence, beyond 166,0 00 iterations. This is supported by the independent 
analysis of Brooks and GiudicJ (2000) who demonstrate evidence for convergence of this 
sampler after around 150,000 iterations, although they caution that their chain lengths of 
only 200,000 iterations were too short for certainty. 



Figure [L2] (d). adapted from lSisson and FanI (120071 ). illustrates the PSRF of the distances 
from each of 100 randomly chosen reference points to the nearest model components, over the 
five replicate chains. Up to around 100,000 iterations, between-chain variation is still reduc- 
ing; beyond 300,000 iterations, differences between the chains appear to have stabilised. The 
intervening iterations mark a gradual transition between these two states. This diagnostic 
appears to be the most conservative of those presented here. 



Figure 11.21 near here. 



This example highlights that empirical convergence assessment tools often give varying 
estimates of when convergence may have been achieved. As a result, it may be prudent to 
follow the most conservative estimates in practice. While it is undeniable that the benefits 
for the practitioner in implementing reversible jump sampling schemes are immense, it is 
arguable that the practical importance of ensuring chain convergence is often overlooked. 
However, it is also likely that current diagnostic methods are insufficiently advanced to 
permit a more rigourous default assessment of sampler convergence. 
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One of the useful by-products of the reversible jump sampler, is the ease with which Bayes 
factors can be estimated. Explicitly expressing marginal or predictive densities of x under 
model A^fc as 



mk{x) = / L{x\k,6k)p{0k \ k)dOk, 
the normalised posterior probability of model A^a: is given by 

p{k)mk{x) 



p{k I x) 



Y.k'eKPi^')^k'{x) 



/ ^ /, X -Pfc'.fc 



p{k) 



where By^^ = rnk'{x)/mk{x) is the Bayes factor of model Aik' to Aik, and p{k) is the 

ectiq n techniques, see 



Ghosh and Samanta 



prior probability of m o del Aik.. For a discu s sion o f Bayesian model s e ^ 
Chipman et al.l fl200l[).lBerger and Pericchil fl200l[).lKass and Raftervi (jl995[ l. 
( I2OOII ). iBerger and Pericchil (120041 ). iBarbieri and Bergerl ( l2004l ). A usual estimator of the 
posterior model probability, p{k \ a;), is given by the proportion of chain iterations the 
reversible jump sampler spent in model Mk- 

However, when the number of candidate models is large, the use of reversible jump 
MCMC algorithms to evaluate Bayes factors raises issues of efficiency. Suppose that model 
Aik accounts for a large proportion of posterior mass. In attempting a between-model 
move from model Aik, the reversible jump algorithm will tend to persist in this model and 
visit others models rarely. Conseque ntly, estimates of Bayes factors based on model- visi t 



proportions will tend to be inefficient (IBartolucci and Scaccia 



2003 



Han and Carlinl . 120011 ). 



Bartolucci et al. 



(120061 ) propose enlarging the parameter space of the models under com- 
parison with the same auxiliary variables, u ~ qd^^yiu) and u' ~ qdy^^iu') (see Equation 
ll.2.2p . defined under the between-model transitions, so that the enlarged spaces, {Ok, u) and 
{dki, u'), have the same dimension. In this setting, an extension to the Br idge estimator for 



the e stimation of the ratio of normalising constants of two distributions (IMeng and Wong . 



19961 ) can be used, by integrating out the auxiliary random process (i.e. u and u') involved 



in the between- model moves. Accordingly, the Bayes factor of model Aik' to Aik can be 
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estimated using the reversible jump acceptance probabilities as 



Bk' 



k'k 



where a^^^[{k, Ok), {k\ O'f^,)] is the acceptance probability (Equation ll.2.2p of the j-th attempt 
to move from model M.k to M.k'i and where Jk and are the number of proposed moves 
from model A^^ to Aik' and vice versa during the simulation. Further manipulation is 
required to estimate Bk' k if the sampler does not jump between models A4.k and M.k' directly 



( Bartolucci et al. 



20061 ). This approach can provide a more efficient way of postprocessing 



reversible jump MCMC with minimal computational effort. 



1.4 Related multi-model sampling methods 

Several alternative multi-model sampling methods are available. Some of these are closely 
related to the reversible jump MCMC algorithm, or include reversible jump as a special case. 



1.4.1 Jump diffusion 



Before the development of the reversible jump sampler, iGrenander and Milled (119941 ) pro- 
posed a sampling strategy based on continuous time jump- diffusion dynamics. This process 
combines jumps between models at random times, and within-model updates based on a 
diffusion process according to a Langevin stochastic differential equation indexed by time, t, 
satisfying 

dei = dBl + ]^v\og'K{e\)dt 

where dB\ denotes an increment of Brownian motion, and V the vector of partial derivatives. 



'his methoc 



flMiller et al. 



las found some application in s ignal processing and other Bayesian analyses 



1995 



Phillips and Smithl . Il996l ). but has in general been superceded by the 



more accessible reversible jump sampler. In practice, the continuous-time diffusion must be 
approximated by a discrete-time simulation. If the time-discretisation is corrected for via 
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a Metropolis-Hastings acceptance probability, t 
an implementation of reversible jump MCMC 



le jump- diff usion sampler actually results in 



dBesag 



1994h 



1.4.2 Product space formulations 

As an alternative to samplers designed for implementation on unions of model spaces, 
© = Ufce/c({^}' ^"'°)' ^ number "super-model" product-space frameworks have been de- 
veloped, with a state space given by ©* = ^keK:{{k},'R.'^''). This setting encompasses all 
model spaces jointly, so that a sampler needs to simultaneously track 0k for all k & K,. The 
composite parameter vector, 0* G 0*, consisting of a concatenation of all parameters un- 
der all models, is of fixed- dimension, thereby circumventing the necessity of between-model 
transitions. Clearly, product-spac e samplers are limited t o situations where the dimension 
of 0* is computationally feasible. 



Carlin and Chib 



(119951 ) propose a posterior distribution 



for the composite model parameter and model indicator given by 

7r{k,0* I x) oc L{x I k,0*M0l I k)p{0* I 0l,k)p{k), 



where and X_fc are index sets respectively identifying and excluding the parameters 0k 
from 0*. Here Xk flXfc/ = for all k ^ k', so that the parameters for each m odel are distinct. 



Carlin and Chib 



It is e asy to see that the term p{0x_^. I ^x^^k)^ called a "psudo-prior" by 
19951 ). has no effect on the joint posterior 7i{k,02^ \ x) = 7r{k,0k \ x), and its form 



(Godsill. 


2003; 


Green, 


2003) 



Godsilll (120011 ) proposes a further generalisation of the above by relaxing the restriction 
that XfcflXfc/ = for all k ^ k'. That is, individual model parameter vectors are permitted to 
overlap arbitrarily, which is intuitive for, say, nested models. This framew ork can be shown 



Carlin and Chib 



to enc ompass the reversible jump algorithm, in addition to the setting of 
(119951 ). In theory this allows for direct comparison between the three samplers, although this 
has not yet been fully examined. However, one clear point is that the information contained 
within 0^ would be useful in generating efficient between-model transitions when in model 
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A4k, under a reversible jump sampler. This idea is exploited by iBrooks et al.l (l2003d ). 



1.4.3 Point process formulations 



A differen t perspe c tive on the raulti-model samp l er is b ased on spatial birth-and-death 



processes (iPreston 



1977 



Ripleyl . Il977l ). IStephend ( l2000al ) observed that particular multi- 



model statistical problem s can be represented as continuous time, marked point processes 
( iGeyer and MoUerl . Il994j ). One obvious setting is finite mixture modelling (Equation 11.1.51) 
where the birt h and death of m ixture components, (f)j, indicate transitions between models. 
The sampler of lStephensI (l2000al ) may be interpreted as a particular cont inuous time, limiting 
version of a sequence of reversible jump algorithms (ICappe et al.l . l2003l ). 



A number of illustrative comparisons of the reversible jump, jump-diffusion 



and p oi nt process framew o rks can be found in the 



( 2001 ) 



Del 



(I2003h and 



i teratu r e. See, 



or example. 



product space 



Andrieu et al. 



■aportas et al.l fl2002h . ICarlin and ChibI f ll995h . ICodsilll ( 120031 . l200lh 
StephenJ f 2000a ). 



Cappe et al 



1.4.4 Multi-model optimisation 



The reversible jump MCMC sampler may be utilised as the underlying random mechanism 
within a s tochastic optimisation framework, given its ability to traverse complex spaces 



efficiently (I Andrieu et al. 



2000 



Brooks et al 



2003al ). In a simulated annealing setting, the 



sampler would define a stationary distribution proportional to the Boltzmann distribution 



BTik,ek)(xexp{-f{k,ek)/T}, 

where T > and f{k,Ok), is a model-ranking function to be minimised. A stochastic 
annealing framework will then decrease the value of T according to some schedule while 
using the reversible jump sampler to explore function space. Assuming adequate chain 
mixing, as T — )■ the sampler and the Boltzmann distribution will converge to a point 
mass at {k*,0l.,) = argmax /(/c, 0^). Specifications for the model-ranking function may 
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include the AIC or BIG (King and Brooka l2004l : ISisson and Fanl . 12009( 1. the posterior model 



proba bihtv fIClvdel. 



space (ISisson and Hurn 



1999) or a non-standard loss function defined on variable-dimensional 



20041 ) for the derivation of Bayes rules. 



1.4.5 Population MCMC 



The population Markov chain Monte Carlo method fjLiang and Wong 



be extended to the reversible 



2001 



Lid, I2OO1I ) may 



ump s etting ( iJasra et al.l . 120071 ) . Motivated by simulated an- 
nealing (jGeyer and Thompson! . Il995l ). parallel reversible jump samplers are implemented 
targetting a sequence of related distributions = 1,...,N, which may be tempered 

versions of the distribution of interest, vti = iT{k,Ok \ x). The chains are allowed to in- 
teract, in that the states of any two neighbouring (in terms of the tempering parameter) 
chains may be exchanged, thereb y improving the m ixing across the population of samplers 



both within and between models. 



Jasra et al 



(120071 ) demonstrate superior convergence rates 



over a single reversible jump sampler. 



Gramacy et al 



or sa mplers that make use of tempering or parallel 



(|2009[ ) propose efficient methods of utilising samples 



simulation techniques, 

from all distributions (i.e. including those not from vri) using importance weights, for the 
calculation of given estimators. 



1.4.6 Multi-model sequential Monte Carlo 



The idea of running multiple samplers over a sequence of related distributions ni a y also con- 
sidere d under a sequential Monte Carlo (SMC) framework (jPel Moral et al.l . 120061 ). 



Jasra et al 



( I2OO8I ) propose implementing separate SMC samplers, each targetting a different subset 
of model-space. At some stage the samplers are allowed to interact and are combined into 
a single sampler. This approach permits more accurate exploration of models with lower 
posterior model probabilities than would be possible under a single sampler. As with pop- 
ulation MCMC methods, the benefits gained in implementing A^ samplers must be weighed 
against the extra computational overheads. 
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1.5 Some discussion and future directions 



Given the degree of complexity associated with the implementation of reversible jump 
MCMC, a major focus for future research is in designing simple, yet efficient samplers, with 
the ultimate goal of automation. Several authors have provided new insight on t he reversible 



l ump 



sampler which may contribute towards achieving such goals. For example. 



Keith et al. 



( 120041 ) prese nt a generalised Msn 



special case. 



Petris and Tardella 



^ ov sam pler, which includes the reversible jump sampler as a 
( I2OO3I ) demonstrate a geometric approach for sampling from 
nested models, formulated by drawing from a fixed- dimension auxiliary continuous distribu- 
tion on t he larges t mod el subspace, and then using transformations to recover model-specific 
samples. IWalkerl (j2009[ ) has recently provided a Gibbs sampler alte rnative to the reversible 
jump MCMC, using auxiliary variables. Additionally, as noted by ISissonl (120051 ). one does 



not need to work only with reversible Marko y chains, and that non-reversible chains may 



offer oppor tunities for sampler improvement ( jPiaconis et al. 



2000 



Neal 



20041 ). 



Mira and Geyer 



2000 



An alternative way of increasing sampler efficiency would be to explore the ideas intro- 
duced in adaptive MCMC. As with standard MCMC, any adaptations must be implemented 



with care - transition kernels dependent on the e ntire history o 



used under diminishing adaptation conditions (IHaario et al 



the Markov chain can only be 



2001 



Roberts and Rosenthal 



20091 ). Alternative schemes permit modification of the proposal distribution at regeneration 



times, when the next state of the Markov c h ain b ecomes completely independent of the past 



( iBrockwell and Kadane . 



2005 



Gilks et al. 



19981 ). Under the reversible jump framework. 



regeneration can be naturally achieved by incorporating an additional model, from which 
independent samples can be drawn. Under any adaptive scheme, however, how best to 
make use of historical chain information remains an open question. Additionally, efficiency 
gains through adaptations should naturally outweigh the costs of handling chain history and 
modification of the proposal mechanisms. 

Finally, two areas remain under-developed in the context of reversible jump simulation. 
The first of these is perfect simulation, which provides an MCMC framework for produc- 
ing samples exactly from the target distribution, circumventing convergence issues entirely 
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2006 



M0ller and Nicholls 



(iPropp and Wilsonl. Il996 ) . Som e tentative steps have been made in this area ( iBrooks et ah 



19991). Se condly, while the development of "likelihood-free" MCMC 



has received much recent attention (ISisson and Fan 



(120101 ). this volume), implementing the 



sampler in the multi-model setting remains a challenging problem, in terms of both compu- 
tational efficiency and bias of posterior model probabilities. 
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stylos data Enzyme data 




(a) (b) 

Figure 1.1: Examples of (a) change-point modelling and (b) mixture models. Plot (a): With the Stylos tombs 
dataset (crosses), a piecewise log-linear curve can be fitted between unknown change-points. Illustrated are 2 (solid 
line) and 3 (dashed line) change-points. Plot (b): The histogram of the enzymatic activity dataset suggests clear 
groupings of metabolizers, although the number of such groupings is not clear. 
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(c) (d) 

Figure 1 .2: ConverKence ass essment for the enzymatic activity dataset. Plots (a) Kolmogorov-Smirnov and (b) 
tests of iBrooks et alj (|2003lJ ). Horizontal li ne denotes an q = 0.05 significan ce level for test of dif ferent sampling 
distrib utions. Plots (c) multivariate PSRF's of lCastelloe and ZimmermanI (|2002l ^ and (d) PSRFv's of lSisson and FanI 
(|2003). Horizontal lines denote the value of each statistic under equal sampling distributions. 



